Symbolic Regression

Modeling Molecular Mass

Delanyce Rose & Richard Henry

2025-08-03

Outline

  • Symbolic Regression
  • Molecular Mass
  • Typical Workflow
  • Toy Dataset
  • Software Implementation
  • Noisy Toy
  • Belly of the Beast
  • Quiet Toy
  • Alternatives to Evolution
  • Data Exploration
  • Modeling and Results
  • Conclusions

Methods

Definition

  • independent variable \(X\)
  • dependent variable \(y\)
  • analytical expression \(f(X)\)
  • \[\hat{y}_{23} \approx f(X_{23})\]

Implementation

Gene icons created by Talha Dogar - Flaticon

Symbolic Regression Alternatives

  • neural networks
  • \(\frac{dy}{dX}|_{X_{23}}\approx \frac{\hat{y}_{24}-\hat{y}_{23}}{X_{24}-X_{23}}\)
  • linear regression
  • \(\hat{y}_{23} \approx \beta_0 +\beta_1 \cdot X\)
  • best form?
  • \(\hat{y}_{23} \approx \beta_2 +\beta_3 \cdot e^X\)
  • \(\hat{y}_{23} \approx \beta_4 +\beta_5 \cdot log(X)\)
  • \(log(\hat{y}_{23}) \approx \beta_6 +\beta_7 \cdot log(X)\)
  • \(\hat{y}\approx\frac{\beta_0+\beta_1x+\beta_2x^2}{\beta_3+\beta_4x}\)
  • Alternating Conditional Expectations
  • Kolmogorov-Arnold Network

Typical Workflow

  • what
  • how
  • when
  • initial population
  • fitness evaluation
  • stop ??
  • new population

Alternatives to Evolution

  • Deterministic
  • Information Technology
  • Neural Networks

Alternatives to Evolution

  • Deterministic
    • Brute Force
    • Mathematical Programming
    • Sparse Regression
  • Information Technology
    • data-centric
    • expert-centric
  • Neural Networks
    • Transformers
    • Equation-Builders

Analysis and Results

Data Exploration / Visualization

Variable Description Designation
\(Mw\) Molecular Mass dependent
\(SG\) Specific Gravity independent
\(TBP\) True Boiling Point independent
       SG              TBP               MW        
 Min.   :0.6310   Min.   : 306.0   Min.   :  70.0  
 1st Qu.:0.7439   1st Qu.: 373.2   1st Qu.:  99.0  
 Median :0.8474   Median : 584.0   Median : 222.5  
 Mean   :0.8470   Mean   : 575.2   Mean   : 304.6  
 3rd Qu.:0.8784   3rd Qu.: 668.5   3rd Qu.: 297.8  
 Max.   :1.3793   Max.   :1012.0   Max.   :1685.0  

Univariate Analysis

Raw Molecular Mass Histogram

Univariate Analysis

Raw Molecular Mass Box-and-Whiskers Plot

Univariate Analysis

Raw Boiling Point Box-and-Whiskers Plot

Univariate Analysis

Raw Boiling Point Histogram

Univariate Analysis

Raw Specific Gravity Histogram

Univariate Analysis

Raw Specific Gravity Box-and-Whiskers Plot

Bivariate Analysis

Molecular Mass vs Specific Gravity Goossens Data

Bivariate Analysis

Molecular Mass vs Boiling Point Goossens Data

Bivariate Analysis

Specific Gravity vs Boiling Point Goossens Data

Test vs. Train

Molecular Mass vs Boiling Point. Both Datasets

Test vs. Train

Molecular Mass vs Specific Gravity. Both Datasets

Existing Correlations

Symbol Meaning
\(M_w\) Apparent Molecular Mass
\(T_b\) True Boiling Point Temperature
\(\gamma_o\) Specific Gravity
\(a_{00}..a_{09}\) Empirical Constants
\(K_w\) Characterization Factor (intermediate value)
\(X_0...X_3\) Intermediate Variables

Existing Correlations

Hariu & Sage (1969)

\[ M_w = a_{00} + a_{01} K_w + a_{02} K_w^2 + a_{03} T_b K_w + a_04 T_b K_w^2 + a_{05} T_b^2 K_w + a_{06} T_b^2 K_w^2 \]

\[K_w =\frac{\sqrt[3]T_b}{\gamma_o}\]

Kesler & Lee (1976)

\[M_w = X_0 + \frac{X_1}{T_b} + \frac{X_2}{T_b^2}\]

\[X_0 = a_{00} + a_{01} γ_o+ \left (a_{02} + a_{03} γ_o \right ) T_b\]

\[ X_1 = \left (1+ a_{04} γ_o + a_{05}γ_o^2 \right ) \left (a_{06} + \frac{a_ {07}}{T_b} \right ) \cdot 10^7 \]

\[ X_2 = \left (1+ a_{08} γ_o+ a_{09} γ_o^2 \right ) \left (a_{10} + \frac{a_{11}}{T_b} \right ) \cdot 10^{12} \]

American Petroleum Institute (1977)

\[ M_w = a_{00} e^ {\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^{a_{03}} γ_o^ {a_{04}} \]

Winn, Sim & Daubert (1980)

\[M_w = a_{00} T_b^{a_ {01}} γ_o^{a_{02}}\]

Riazi & Daubert (1980)

\[M_w = a_{00} T_b^{a_ {01}}γ_o^{a_{02}}\]

Rao & Bardon (1985)

\[ln {M_w} = (a_{00} + a_{01} K_w) ln (\frac{T_b} {a_{02} + a_{03} K_w} )\]

Riazi & Daubert (1987)

\[ M_w = a_{00} T_b^{a_{01}} γ_o^{a_{02}} e^{\left (a_{03} T_b + a_{04} γ_o + a_{05} T_b γ \right )} \]

Goossens (1996)

\[M_w = a_{00} T_b^{X_0}\]

\[ X_0 =\frac {a_{03} + a_{04} ln {\left (\frac{T_b} {a_{05} - T_b} \right )}} {a_{01} γ_o + a_{02}} \]

Linan (2011)

\[ M_w = a_{00} e^{\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^ {a_{03}} γ_o^{a_{04}} \]

Hosseinifar & Shahverdi (2021)

\[M_w = {\left [a_{00} T_b^{a_{01}} {\left (\frac{3+2γ_o} {3-γ_o} \right )}^{\frac{a_{02}}{2}} + a_{03} T_b^{a_{04}} {\left (\frac{3+2γ_o}{3-γ_o} \right )}^{\frac{a_{05}}{2}} \right ]}^{a_{06}}\]

Stratiev (2023)

\[ M_w = a_{00} + a_{01} e^{\left [a_{02} e^{\left (a_{03} \frac{T_b^{a_{06}}}{γ_o^{a_{05}}} \right )} \right ]} \]

Operators to Consider

Partial List
Operator Type Description
pow binary one expression raised to the power of another
log unary logarithm of an expression
exp unary antilogarithm of an expression
sqr unary expression squared
cub unary expression cubed
inv unary inverse of an expression

Modeling and Results

Default Run

\[ M_w=a_{00}+\frac{a_{01}}{\gamma_o-a_{02}}+\frac{T_b\cdot (a_{03}\cdot T_b-a_{04})}{\gamma_o\cdot (a_{05}-\frac{a_{06}}{T_b})} \]

Correlation Coefficients
Raw MW Goossens
Equation
This
Equation
Raw MW 1.000000 0.999711 0.999847
Goossens Equation 0.999711 1.000000 0.999798
This Equation 0.999847 0.999798 1.000000

Q-Q Plot for the Default Run Residuals

Power Run

\[M_w=a_{00} \cdot a_{01}^{\gamma_o^{-a_{02}} + a_{03} \cdot T_b} + a_{04}\]

Correlation Coefficients
Raw MW This Equation
Raw MW 1.000000 0.997281
This Equation 0.997281 1.000000

Power Run

Exponential Run

\[ M_w= - T_b \cdot \left(a_{00} \cdot T_b - a_{01}\right) \left(a_{02} \cdot 10^{-6} \left(a_{03} \cdot \gamma_o - 2 \cdot T_b \right) \left(T_b - a_{04}\right) - 1\right) + a_{05}\]

\[ M_w= a_{00}\cdot T_b + a_{01} \cdot e^{- \gamma_o^{2} + a_{02}\cdot \gamma_o + a_{03} \cdot T_b}\]

Correlation Coefficients
Raw MW 1st
Equation
2nd
Equation
Raw MW 1.000000 0.997705 0.998420
First Equation 0.997705 1.000000 0.999497
Second Equation 0.998420 0.999497 1.000000

Raw Data Scale Change

Aeon Run

\[ M_w=(a_{00}\cdot T_b-a_{01})e^{a_{02}\cdot 10^{-9}T_b^2(a_{03}\cdot \gamma_0 +a_{04}\cdot T_b)} \]

\[ M_w=-a_{00}\cdot \gamma_o+\frac{e^{-\gamma_o^2+{log(T_b)}^3}}{T_b^{a_{01}}} +a_{02}\cdot T_b \]

Correlation Coefficients
Raw Mw First Equation Second Equation
Raw Mw 1.000000 0.998880 0.999324
First Equation 0.998880 1.000000 0.998851
Second Equation 0.999324 0.998851 1.000000

Box-Cox Run

\(\lambda=-0.3624\)

transformed molecular mass histogram

Log vs Box-Cox Transform

Box-Cox vs. Inv. Sqr. Root

transformed fit

transformed residuals

Shapiro-Wilk p-value \(0.0507\)

untransformed fit \[ M_w^t = \left (\gamma_o \left(a_{00} T_b -a_{01} \right) log{(\gamma_o)} + a_{02} \right) log{(T_b)} \]

Reverse the Box-Cox transform:

\[ M_w = {\left(\lambda M_w^t +1 \right)}^{\frac{1}{\lambda}} \]

Correlation Coefficients (Transform Reversed)
Raw Mw This Equation
Raw Mw 1.000000 0.995897
This Equation 0.995897 1.000000

Residuals outside of Box-Cox Space

First Sparse Run

\[ M_w=-a_{00} \cdot \gamma_o +\frac{a_{01}\cdot \gamma_o}{T_b}+a_{02}\cdot T_b -a_{03}\]

Correlation Coefficients
Raw Mw This Equation
Raw Mw 1.000000 0.970451
This Equation 0.970451 1.000000

First Sparse Run

Second Sparse Run

\[ M_w=-a_{00}\cdot \gamma_o\cdot T_b +a_{01}\cdot T^2_b +a_{02}\cdot e^{\gamma_o} +a_{03}\]

Correlation Coefficients
Raw Mw
Raw Mw 1.000000
1st Run Equation 0.970451
2nd Run Equation 0.985800

Second Sparse Run

Validation Runs

Run Correlation Coefficient
Raw Mw 1.000000
Default 0.999957
Goossens Correlation 0.999939
Power 0.996964
Exponential #2 0.998954
Aeon #1 0.999973
Aeon #2 0.999921
Box-Cox 0.999696
Sparse #1 0.992303
Sparse #2 0.996331

General Observations

  • UOP Characterization Factor
  • Replication
  • Explainability
  • Table Look-Up

Conclusion